Data enrichment and matching

ABSTRACT

In an embodiment, a process for data enrichment and matching includes obtaining a first dataset associated with a first user from a first data source, where the first dataset includes records from a structured data source, and obtaining a corresponding second dataset associated with a second user. The process includes enriching at least one of the first dataset and the second dataset. The process includes merging the first dataset and the second dataset including by matching a set of attributes based at least in part on matching corresponding attributes, wherein at least one of the first dataset and the second dataset has been enriched. The process includes outputting the merged data.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/081,929 entitled DATA ENRICHMENT AND MATCHING filed Oct. 27, 2020,which claims priority to U.S. Provisional Patent Application No.62/928,936 entitled DATA ENRICHMENT AND MATCHING filed Oct. 31, 2019,both of which are incorporated herein by reference for all purposes.

BACKGROUND OF THE INVENTION

Data storage systems such as customer relationship management (CRM)systems are useful for managing data records corresponding anorganization's associates such as customers or prospective customers.Many different data storage systems exist today. Reconciling data fromdifferent systems is technically challenging because different storagesystems may store datasets in different ways and use differentstructuring paradigms. Conventional techniques for combining distinctdatasets are typically inaccurate or inefficient.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a flow diagram illustrating an embodiment of a process fordata enrichment and matching.

FIG. 2 is a block diagram illustrating an embodiment of a system fordata enrichment and matching.

FIG. 3 is a block diagram illustrating an embodiment of a system fordata enrichment and matching.

FIG. 4 shows an example of a graphical user interface for dataenrichment and matching in a state for creating a dataset.

FIG. 5 shows an example of a graphical user interface for dataenrichment and matching in a state after importing data prior toenrichment.

FIG. 6 shows an example of a graphical user interface for dataenrichment and matching while enrichment is in progress.

FIG. 7 shows an example of a graphical user interface for dataenrichment and matching while enrichment is in progress.

FIG. 8 shows an example of a graphical user interface for dataenrichment and matching in a state for selecting datasets to match.

FIG. 9 shows an example of a graphical user interface for dataenrichment and matching in a state for matching results of datasets.

FIG. 10 shows an example of a graphical user interface for dataenrichment and matching in a state for matching results of datasets.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications, andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

Examples of contexts in which distinct datasets may be desired to bemerged, compared, cross-collated, or the like include mergers,acquisitions, transitions to new data storage and management platforms,joint ventures, and other partnerships. For example, two organizationsentering into a joint or other cooperative business venture may want toknow which customers they have in common or which customers (e.g., of acertain type, having a specified attribute) of theirs or of the otherorganization are not common to the two organizations (e.g., are“distinct” to one or the other). They might also want to know who couldbe new potential customers based on their common or distinct customers.

Conventional approaches typically identify common or distinct records indatasets between two organizations as follows. A first user (associatedwith a first organization) exports data into a CSV or a Microsoft Excel®format and shares the exported data with a second user (associated witha second organization). The second user then compares the first user'sdata with the second user's data. The second user's data could be in aCSV, Excel®, or other report format. The second user typically performsthe comparison manually row-by-row or using a script written aprogramming language such as Python. Thus, conventional methods forcomparing distinct datasets is laborious, requiring extensive user orprogrammer involvement.

Techniques are disclosed to import, enrich, match, andsecurely/selectively share data from distributed/disparate data sources.In various embodiments, data from two or more datasets, e.g., salesaccount data or other data from two organizations in a joint or othercooperative venture, are imported, normalized, enriched, etc. andrecords from the datasets are compared to identify common and/ordistinct records.

In various embodiments, a process for data enrichment and matchingincludes obtaining a first dataset associated with a first user from afirst data source, where the first dataset includes entities associatedwith records from a structured data source (such as customerrelationship management system, marketing automation system, or otherstructured data storage with enterprise/business logic). The processobtains a corresponding second dataset associated with a second user.The process enriches the first dataset and/or the second dataset, andmerges the enriched first and second datasets including by matching aset of attributes. The process then outputs the merged data. The mergeddata refers to a resulting dataset that matches criteria set in theattributes and can be obtained from at least a subset of differentdatasets used in the process. Distinct datasets (e.g., datasets fromdistinct enterprises and/or other distinct data sources) can be mergedand/or compared, while providing granular control of which data is seenby which user(s) (e.g., distinct versus duplicate customers, own dataversus other's data, subset of a data record, etc.).

FIG. 1 is a flow diagram illustrating an embodiment of a process fordata enrichment and matching. This process may be implemented by system200 of FIG. 2 or system 300 of FIG. 3 .

The process begins by obtaining a first dataset associated with a firstuser from a first data source (100). The first dataset includes entitiesassociated with records from a structured data source such as one withenterprise/business logic, e.g., customer relationship management,marketing automation system, or the like. An example of a datasetfurther described with respect to FIG. 4 .

In various embodiments, a dataset can be imported or otherwise obtainedfrom a data store or third-party provider. The dataset can be preparedby a user as follows. A user (e.g., the first user or another userassociated with the first user) imports data that is available invarious disparate data sources within their organization into a softwareapplication (such as system 200 of FIG. 2 or system 300 of FIG. 3 ). Toimport data, in some embodiments, the user does one or more of thefollowing: leverages a native integration built between the applicationand data source; leverages a third-party integration that is available;and/or imports data from a report that has been generated from the datasource.

This data report can be available in either a structured or anunstructured data format. In the case of unstructured data, certain datatransformations can be performed that will convert the unstructured datainto a structured data format. Once structured data is available, thedata attributes that wish to be imported from the report or via theintegration are mapped to the data attributes in the application. Incases where certain data attributes are not directly present in theapplication, custom data objects/attributes can be created therebygiving the flexibility to map data attributes and objects not present inthe application. Such imported data can either be directly consumed in ahost application or can be stored as a “dataset” for processing by thisprocess. In various embodiments, data is imported from distributed datasources/systems.

The process obtains a corresponding second dataset associated with asecond user (102). The second dataset can be prepared and obtained inthe same way as the first dataset. In various embodiments, the seconduser belongs to an organization outside the organization associated withthe first user. For example, in a merger between a first company and asecond company, the first user belongs to a first company, and thesecond user belongs to a second company.

The process enriches at least one of the first dataset and the seconddataset (104). Enrichment enables higher fidelity matching between thetwo datasets. The data is enriched and/or augmented, either for theexisting data attributes or by adding new data attributes, to createnormalized data.

Data that is imported at 100 and 102 may not be complete and/or correct.This could be due to lack of data at the source or because of humanerror (e.g., typographical errors). In cases where data is not completeor not correct, in various embodiments the data is enriched/augmented inthe application after importing from the data source and creating a“dataset”. The data enrichment process in some embodiments includestaking either a single row of data or a single attribute of the data(e.g., account name) in a row of data from the dataset in the hostapplication and checking the data value in publicly and privatelyavailable data sources that are configured by the application. The datasources can rest outside the host application, be available within thehost application, or within another system hosted by the organizationthat is enriching the data. Once a data match is established, relevantattributes are retrieved from the publicly/privately available datasources. One can also find a match in a public/private data source,retrieve certain data values from the data source for the matchingrecord, then go to another public/private data source and attempt toretrieve additional data attributes/values. Thus, the original datarecord gets enriched and augmented with additional data values thateither did not exist earlier or were incorrect values. This increasesaccuracy and makes it easier to match data.

The process merges the enriched first dataset and the enriched seconddataset including by matching a set of attributes based at least in parton matching corresponding attributes (104). Normalized data is used toidentify close (e.g., fuzzy) matches/duplicates with data that ispresent in another dataset(s) thereby surfacing common and distinctdata. In various embodiments, the first and/or second user can specifymatch criteria via a graphical user interface as further describedbelow. In various embodiments, the process determines first filtereddata associated with the first user and second filtered data associatedwith the second user, where the first filtered data and the secondfiltered data match filter criteria associated with the first datasetand the corresponding second dataset. This offers privacy or securitybenefits because it gives users control over what information getsshared with other users or implements privacy/security policies.

The merge is performed taking into the specified match criteria tofilter and obtain matching data. For example, a user can specify that“Account” in one dataset maps to “Acct Name” in a second dataset. Usingthis information, data in column “Account” can be merged with data incolumn “Acct Name.” Filtering can be performed by row or column. Theprocess can suggest some filtering criteria such as what informationshould be shared between two users. The information to be shared candepend on confidentiality, privacy, or security considerations.

In various embodiments, after importing (and creating) datasets at 100and 102, the user can share the dataset with another user either withinthe organization or outside the organization. The user can set certainpermissions and privileges on how other users can access/view/performoperations on the data in the dataset or the dataset in its entirety.For example, the user who created (e.g., imported) the dataset can sharejust the name of the dataset and not necessarily the contents of thedataset. In another case, the user can share the name of the dataset, afew user-selected columns of data in the dataset and not all the columnsin the dataset. In another case the user might set permissions that onlyreveal some or all columns of the data when certain operations (e.g.,matching) are performed on the dataset.

In various embodiments, merging the enriched first data set and theenriched second dataset includes combining the enriched first data setand the enriched second dataset, comparing the enriched first data setto the enriched second dataset, and generating metadata based on thecomparison of the enriched first data set to the enriched seconddataset. For example, the metadata contains information (e.g., a flag)about whether data is common (e.g., similar or the same) to bothdatasets or distinct to one of the datasets.

Once common or distinct data is identified, the merged data can beoutput in the form of list such as a list of new potential customers.This is done in some embodiments by suggesting competitors of currentcustomers (either common or distinct). The list of competitors ofcurrent customers is generated in some embodiments by leveragingspecific techniques like web scraping or comparing it against a databaseof companies and their competitors in their industries or in otherindustries.

The process outputs the merged data (106). The merged data can be outputto storage, an analysis engine, or for rendering on a graphical userinterface, some examples of which are further discussed below. To matchdata that is available in two or more datasets, in various embodiments,the user provides the datasets that need to undergo the matchingexercise and can decide to match the datasets in entirety or handpickthe attributes (e.g., column names in a dataset) of the datasets thatneed to be matched. The application then compares these two datasetsaccording to the instructions provided by the user and looks formatching (or nearly matching) data values. The application can provide asummary of common data records and/or distinct data records. Aftermatching the datasets either all records can be displayed or the recordscan be displayed based on the permissions/privileges set by the user.

In various embodiments, initial merged results are stored in anintermediate storage. One or more users can interact with the initialresults, for example selecting a subset. The selected subset then getsstored in a (final) merged data store. For example, initially 10 rows ofmerged data are output, a first user selects five of the rows and asecond user selects two additional rows, and those seven rows get storedin a final merged data store. One or more users can further interact thedata stored in the final merged data store, for example downloading thefinal dataset to perform actions offline, store it in associated profilein a Web application, or pushing the data into an external data store.The intermediate data store can improve the speed and performance of thesystem because not all of the intermediate data store is of interest tothe users.

FIG. 2 is a block diagram illustrating an embodiment of a system fordata enrichment and matching. As will be apparent, other computer systemarchitectures and configurations can be used to perform data enrichmentand matching. Computer system 200, which includes various subsystems asdescribed below, includes at least one microprocessor subsystem (alsoreferred to as a merging engine or a processor) 204. For example, system200 can be implemented by a single-chip processor or by multipleprocessors. In some embodiments, system 200 is a general purpose digitalprocessor. Using instructions retrieved from memory, the system controlsthe reception and manipulation of input data, and the output and displayof data on output devices. In some embodiments, system 200executes/performs the process of FIG. 1 .

Various embodiments disclosed herein further relate to computer storageproducts with a computer readable medium that includes program code forperforming various computer-implemented operations. Thecomputer-readable medium is any data storage device that can store datawhich can thereafter be read by a computer system. Examples ofcomputer-readable media include, but are not limited to, all the mediamentioned above: magnetic media such as hard disks, floppy disks, andmagnetic tape; optical media such as CD-ROM disks; magneto-optical mediasuch as optical disks; and specially configured hardware devices such asapplication-specific integrated circuits (ASICs), programmable logicdevices (PLDs), and ROM and RAM devices. Examples of program codeinclude both machine code, as produced, for example, by a compiler, orfiles containing higher level code (e.g., script) that can be executedusing an interpreter.

An example of system 200 is accessed via a Web application by ProntoTechnology, Inc. System 200 is communicatively coupled to a first client210 and a second client 220. A user of the first client (“first user”)and a user of the second client (“second user”) can each interact withsystem 200 to merge or compare data belonging to the first user and thesecond user. System 200 is configured to perform the process of FIG. 1to compare or merge distinct datasets.

System 200 includes communications interface 202 and merging engine 204.Data associated with the first user or second user may be kept in aremote location such as raw data store 230 and third party dataproviders 240. Communications interface 202 is configured to obtain datafrom raw data store 230 and third party data providers 240. Examples ofraw data stores include SalesForce® and Microsoft Dynamics®. Examples ofthird party data providers include Crunchbase® and ZoomInfo®. An exampleof communications interface 202 is a Web portal, which can beimplemented by a Django® Web framework for example.

Merging engine 204 is configured to compare and/or merge data obtainedby the communications interface 202 by performing the process of FIG. 1. The merging engine 204 outputs intermediate results to temporarystaging data store 250, which may include unstructured data. One ofadvantage of using a structureless database for temporary staging datais that there is no limit on the data. The structure of data can bepreserved in such a structureless databased using key-values. Aftermerging/comparison is complete, the merging engine outputs merged datato merged data store 260. An example of temporary staging data store 250is MongoDB®. An example of merged data store 260 is Postgres®.

System 200 can query one or more of the stores 230, 240, 150, 260, ordata provider 240 periodically or on-demand to ensure that data remainsfresh.

FIG. 3 is a block diagram illustrating an embodiment of a system fordata enrichment and matching. Each of the components are like theircounterparts in FIG. 2 unless otherwise described. The system 300includes enrichment engine 306. System 200 calls a third party toperform enrichment or received data that has already been enriched(e.g., from 240) in various embodiments. By contrast, system 300 canperform enrichment locally.

Enrichment engine 306 is configured to enrich data for example byperforming 106 of FIG. 1 . The enrichment engine can enrich specificrows within a table by updating the row to replace missing or incorrectinformation. The enrichment engine can also perform additionalintelligence related to the data in specific rows, for example noticinga pattern.

The following figures show examples of graphical user interfacesenabling a user to interact with the process of FIG. 1 .

FIG. 4 shows an example of a graphical user interface for dataenrichment and matching in a state for creating a dataset. This exampleshows a process of importing data for a user 404. The user interfaceincludes a menu 402 for navigating the Web portal.

Panel 410 shows a current company (Acme Corp) for which a dataset isbeing retrieved. Associated customers, accounts, or datasets can bedisplayed by selecting the appropriate links. The information can bedisplayed in sorted order (alphabetical in this example) and provideoptions to the user to sort in another way. For example, clicking thearrow next to “Account Name” will cause the accounts to be sorted inreverse alphabetical order.

In this example, the user interface is prompting the user to selectaccounts from the displayed results to create one or more datasets. Theuser can select a checkbox corresponding to data to be added to adataset, and click the “create dataset” button to form a dataset made upof the selected accounts. In this example, no accounts have beenselected yet.

FIG. 5 shows an example of a graphical user interface for dataenrichment and matching in a state after importing data prior toenrichment. Each of the components are like their counterparts in FIG. 4unless otherwise described.

In this example, the dataset is made up of Company A to Company D (andsome others not shown here but that can be displayed using the scrollbar. A user can further refine this dataset by selecting an entry toremove the account from the current dataset. As shown, the dataset isdisplayed along with its owner (User 1), company (Acme Corp), the timeof creations (21 Oct. 2019) and last updated (21 Oct. 2019). The datasetcan have an associated expiration date (30 days from now in thisexample) to ensure data freshness, comply with privacy or securityregulations, etc.

The user interface includes an “enrich” link 530 that triggers anenrichment process or step (e.g., 104). This link 530 can be accompaniedby or replaced by an icon indicating a state of enrichment. For example,prior to enrichment, text or a color in the icon can indicate that thedataset has not been enrichment. For example, the icon is grey prior itenrichment and after enrichment, the icon becomes green.

FIG. 6 shows an example of a graphical user interface for dataenrichment and matching while enrichment is in progress. Each of thecomponents are like their counterparts in FIG. 5 unless otherwisedescribed. Element 602 indicates that enrichment is in progress. In thisstate, enrichment has just begun so none of the blank (--) fields insection 604 have been updated yet. At the end of enrichment, the blankfields will be populated with a value if one is found.

FIG. 7 shows an example of a graphical user interface for dataenrichment and matching while enrichment is in progress. Each of thecomponents are like their counterparts in FIG. 6 unless otherwisedescribed. As shown, the missing street names, cities, and states havebeen found and filled out in section 704. Element 702 indicates thatenrichment is complete.

FIG. 8 shows an example of a graphical user interface for dataenrichment and matching in a state for selecting datasets to match. Eachof the components are like their counterparts in FIG. 7 unless otherwisedescribed. A user from a first organization can set permissions so thata user from a second organization can select data sets belonging to thefirst organization, enrich the data sets, etc. as shown.

In this example, Widget Inc. and Acme Corp want to merge/compare theirdatasets. As shown, each organization has respective datasets. Thedatasets are displayed along with their users who have permission towork with dataset, owners, number of accounts, enrichment state, andexpiration date. In this example, a user selects “sample dataset 1” fromWidget Inc. and “sample dataset 2” from Acme Corp in order tomatch/merge/compare these two datasets.

FIG. 9 shows an example of a graphical user interface for dataenrichment and matching in a state for matching results of datasets.Each of the components are like their counterparts in FIG. 8 unlessotherwise described. As shown, each entry is displayed with whether itis “common” meaning both Widget Inc. and Acme Corp had this data or“distinct” meaning only one of the organizations had this data.

FIG. 10 shows an example of a graphical user interface for dataenrichment and matching in a state for matching results of datasets.Each of the components are like their counterparts in FIG. 9 unlessotherwise described. This user interface conveys common/distinct data ina different way. The source box below the headings show the source ofthe data. For example, some of the data came from source “Src 1” whileother data came from source “Src 2”. Example sources include Dropbox®and BetterCloud®.

As shown, each entry is displayed with whether it is “common” meaningboth Widget Inc. and Acme Corp had this data or “distinct” meaning onlyone of the organizations had this data.

The disclosed data enrichment and matching techniques have manyadvantages over conventional techniques. In one aspect, overlaps of databetween companies (distinct datasets) can be identified whereastypically data is enriched internally to a company. In another aspect,data sources with different data formats (e.g., live data vs. flat file)can be merged. In yet another aspect, datasets for two or moreorganizations can be merged unlike conventional techniques can onlymerge datasets for only two organizations or within a singleorganization.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A system, comprising: a communication interfaceconfigured to: obtain a first dataset associated with a first user froma first data source, wherein the first dataset includes records from astructured data source; and obtain a corresponding second datasetassociated with a second user; a processor coupled to the communicationinterface and configured to: enrich at least one of the first datasetand the second dataset; create an intermediate dataset based at least inpart on the enriched at least one of the first dataset and the seconddataset, wherein the intermediate dataset includes unstructured data;merge the first dataset and the second dataset including by matching aset of attributes based at least in part on matching correspondingattributes based at least in part on the intermediate dataset; andoutput the merged dataset, wherein the merged dataset includesstructured data.
 2. The system of claim 1, wherein the structured datasource includes a customer relationship management system.
 3. The systemof claim 1, wherein the processor is further configured to: determinefirst filtered data associated with the first user and second filtereddata associated with the second user, wherein the first filtered dataand the second filtered data match filter criteria associated with thefirst dataset and the corresponding second dataset.
 4. The system ofclaim 3, wherein at least one of the filter criteria is specified by atleast one of the first user and the second user.
 5. The system of claim3, wherein at least one of the filter criteria is pre-defined based atleast in part on a security policy.
 6. The system of claim 1, whereinthe obtained first dataset includes fields specified by the first userto be exposed to other users such that only those fields specified bythe first users are displayable to other users.
 7. The system of claim6, wherein outputting the merged dataset includes displaying matchingrecords or common customers that neither the first user nor the seconduser has.
 8. The system of claim 6, wherein the fields include a subsetof columns in the first dataset.
 9. The system of claim 1, whereinenriching at least one of the first dataset and the second datasetincludes augmenting at least one of the first dataset and the seconddataset with data from another source.
 10. The system of claim 1,wherein enriching at least one of the first dataset and the seconddataset includes at least one of: filling in missing information andcorrecting errors.
 11. The system of claim 1, wherein enriching at leastone of the first dataset and the second dataset is performed for atleast one row in at least one of the first dataset and the seconddataset.
 12. The system of claim 1, wherein enriching at least one ofthe first dataset and the second dataset is performed for at least oneattribute in data of at least one of the first dataset and the seconddataset.
 13. The system of claim 1, wherein enriching at least one ofthe first dataset and the second dataset includes normalizing data. 14.The system of claim 1, wherein merging the first data set and the seconddataset includes: combining the first data set and the second dataset;comparing the first data set to the second dataset; and generatingmetadata based on the comparison of the first data set to the seconddataset.
 15. The system of claim 14, wherein the metadata includesinformation about at least one of: commonalities and distinctions in thefirst data set and the second dataset.
 16. The system of claim 1,wherein outputting the merged dataset includes displaying the mergeddataset on a graphical user interface.
 17. The system of claim 1,wherein the processor is further configured to: receive user input onthe merged data; update the merged data based on the received userinput; and store the updated merged data in a structured data store. 18.The system of claim 1, wherein the first user is associated with a firstorganization and the second user is associated with a secondorganization different from the first organization.
 19. A method,comprising: obtaining a first dataset associated with a first user froma first data source, wherein the first dataset includes records from astructured data source; obtaining a corresponding second datasetassociated with a second user; enriching at least one of the firstdataset and the second dataset; creating an intermediate dataset basedat least in part on the enriched at least one of the first dataset andthe second dataset, wherein the intermediate dataset includesunstructured data; merging the first dataset and the second datasetincluding by matching a set of attributes based at least in part onmatching corresponding attributes based at least in part on theintermediate dataset; and outputting the merged dataset, wherein themerged dataset includes structured data.
 20. A computer program productembodied in a non-transitory computer readable medium and comprisingcomputer instructions for: obtaining a first dataset associated with afirst user from a first data source, wherein the first dataset includesrecords from a structured data source; obtaining a corresponding seconddataset associated with a second user; enriching at least one of thefirst dataset and the second dataset; creating an intermediate datasetbased at least in part on the enriched at least one of the first datasetand the second dataset, wherein the intermediate dataset includesunstructured data; merging the first dataset and the second datasetincluding by matching a set of attributes based at least in part onmatching corresponding attributes based at least in part on theintermediate dataset; and outputting the merged dataset, wherein themerged dataset includes structured data.